Diffusion Model
Diffusion Models are a class of generative models that simulate the process of data being gradually corrupted by noise (forward diffusion), and then learn the reverse process to recover data from noise.
1. Core Idea
Diffusion models consist of two core processes:
| Process | Direction | Description |
|---|---|---|
| Forward Diffusion | Data
|
Gradually add Gaussian noise to data until it becomes pure noise |
| Reverse Denoising | Noise
|
Learn to gradually recover original data from noise |
[!NOTE] Physical Analogy
Similar to diffusion phenomenon in thermodynamics: a drop of ink in water gradually diffuses until uniformly distributed. The reverse process is “condensing” back to the initial state from uniform distribution.
2. Forward Diffusion Process
2.1 Discrete-Time Formulation (DDPM)
Given data point
where
After
Common Noise Schedules:
| Schedule | Formula | Characteristics |
|---|---|---|
| Linear |
|
Simple, widely used |
| Cosine |
|
Better for small
|
| Quadratic |
|
Slower initial noise |
2.2 Reparameterization Trick
Define
Key Property: We can sample
2.3 Continuous-Time Formulation ([[Stochastic Differential Equation (SDE)|SDE]])
The forward process can be written as [[Stochastic Differential Equation (SDE)|Stochastic Differential Equation]]:
where
Discrete-Continuous Correspondence:
3. Reverse Denoising Process
3.1 Discrete-Time Formulation
The reverse process is also modeled as Gaussian distribution:
Learn mean
Optimal Reverse Distribution (when
3.2 Simplified Training Objective (DDPM)
Ho et al. (2020) proposed simplified loss function:
-
: true noise -
: noise predicted by neural network
Full Variational Lower Bound:
3.3 Continuous-Time Formulation (Score-based)
Reverse [[Stochastic Differential Equation (SDE)|SDE]]:
where
[[Score Function]] Estimation:
The [[Score Function]] is learned via score matching:
where
4. Core Formula Summary
[!QUOTE] DDPM Forward Noising
[!QUOTE] DDPM Simplified Loss
[!QUOTE] Forward [[Stochastic Differential Equation (SDE)|SDE]]
[!QUOTE] Reverse [[Stochastic Differential Equation (SDE)|SDE]]
[!QUOTE] [[Probability Flow ODE]]
5. Main Variants
| Model | Features | Key Contributions |
|---|---|---|
| DDPM | Discrete-time, pixel space | Established the basic framework of diffusion models |
| DDIM | Deterministic sampling, accelerated generation | Non-Markovian forward process, supports skip-step sampling |
| Score [[Stochastic Differential Equation (SDE)|SDE]] | Continuous-time [[Stochastic Differential Equation (SDE)|SDE]] framework | Unified DDPM and Score Matching |
| LDM | Latent space diffusion | Perform diffusion in VAE latent space, reducing computation |
| DiT | Transformer architecture | Use Transformer instead of U-Net |
| EDM | Improved design choices | Better architecture, sampling, and training |
| Stable Diffusion | Text-conditional LDM | Cross-attention for text guidance, widely adopted |
5.1 DDIM (Denoising Diffusion Implicit Models)
DDIM generalizes DDPM to non-Markovian processes:
where
-
: DDPM (stochastic) -
: DDIM (deterministic)
Key Advantage: Can use fewer timesteps (e.g., 50 instead of 1000) for faster sampling.
5.2 LDM (Latent Diffusion Models)
Instead of diffusing in pixel space, LDM operates in latent space:
- Compress:
using VAE encoder - Diffuse: Apply diffusion process to
- Decode:
using VAE decoder
Benefits:
- Lower dimensionality (e.g.,
vs ) - Faster training and inference
- Perceptual compression preserves semantic information
5.3 DiT (Diffusion Transformers)
Replace U-Net with Transformer architecture:
- Patching: Split image into patches (like ViT)
- Self-attention: Capture long-range dependencies
- Scaling: Better performance with larger models
- Flexibility: Easy to incorporate conditioning
Result: DiT-XL/2 outperforms U-Net on ImageNet generation.
6. Training and Sampling Algorithms
6.1 Training Loop
1 | # Pseudocode |
Training Tricks:
- t weighting: Weight loss by
or use uniform weighting - Architecture: U-Net with attention, group normalization, SiLU activation
- EMA: Exponential moving average of model weights for better sampling
- Dropout: Apply to attention layers for regularization
6.2 Sampling Loop (DDPM)
1 | x_T = sample_normal(0, I) |
6.3 Advanced Sampling Methods
| Method | Steps | Approach |
|---|---|---|
| DDPM | 1000 | Original stochastic sampling |
| DDIM | 50-100 | Deterministic, skip steps |
| [[DPM-Solver]] | 10-20 | ODE solver with adaptive steps |
| [[DPM-Solver]]++ | 10-15 | Improved stability |
| UniPC | 5-10 | Unified predictor-corrector |
| Consistency Models | 1-5 | Direct mapping, distillation |
Predictor-Corrector Framework (for [[Stochastic Differential Equation (SDE)|SDE]]-based models):
- Predictor: Take one step using reverse [[Stochastic Differential Equation (SDE)|SDE]]/ODE
- Corrector: Apply Langevin dynamics to refine sample
- Repeat: Alternate for better quality
1 | # Predictor-Corrector Sampling |
7. Advantages and Disadvantages
Advantages
- High generation quality: Reaches or exceeds GAN level
- Stable training: No mode collapse problem like GAN
- Elegant theory: Based on thermodynamics and [[Stochastic Differential Equation (SDE)|SDE]] mathematical foundation
- Flexible conditioning: Easy to incorporate text, image, or other conditions
- Coverage: Better mode coverage than GANs (less mode collapse)
- Likelihood estimation: Can compute exact likelihoods (via ODE)
Disadvantages
- Slow sampling speed: Requires tens to hundreds of iterative steps
- Sensitive to hyperparameters: Noise schedule affects generation quality
- High computational cost: Training requires significant resources
- Blurriness: May produce blurry samples compared to GANs (in pixel space)
Acceleration Methods
Algorithm-Level:
- DDIM (Deterministic sampling)
- [[DPM-Solver]] (Ordinary differential equation solver)
- Progressive Distillation
- Consistency Models (One-step generation)
Architecture-Level:
- Latent space diffusion (LDM)
- Distilled models (smaller, faster)
- Quantization and pruning
Hardware-Level:
- GPU optimization
- Parallel sampling
- Mixed precision training
8. Applications in AI Image Generation
| Application | Representative Models | Features |
|---|---|---|
| Text-to-Image | DALL-E 2, Stable Diffusion, Imagen | Diffusion model + CLIP + Latent space |
| Image Editing | InstructPix2Pix, Prompt-to-Prompt | Conditional guided editing |
| Video Generation | Stable Video Diffusion | Introduce temporal dimension |
| 3D Generation | DreamFusion, Magic3D | Score Distillation Sampling (SDS) |
| Image Super-Resolution | SR3, RePaint | Diffusion + Denoising |
| Inpainting | Stable Diffusion, GLIDE | Mask-guided generation |
| Style Transfer | StyleDrop, Custom Diffusion | Style adaptation |
| Controlled Generation | ControlNet, T2I-Adapter | Spatial control signals |
8.1 Text-to-Image Generation
Architecture:
- Text Encoder: CLIP, T5, or custom transformer
- Conditioning: Cross-attention in U-Net/DiT
- Diffusion: Latent space denoising
- Decoder: VAE decoder to pixel space
Training Data: LAION-5B, COCO, internal datasets
8.2 Image-to-Image Translation
Given source image
Methods:
- Img2Img: Add noise to source, then denoise with conditioning
- ControlNet: Copy and adapt U-Net weights for control
- IP-Adapter: Image prompt adapter for visual conditioning
9. Conditional Diffusion Models
9.1 Classifier Guidance
Pros:
- Works with pre-trained classifiers
- Flexible guidance strength
Cons:
- Requires training separate classifier
- Limited to classification conditions
9.2 Classifier-Free Guidance
where
Training: Randomly drop condition (e.g., 10% probability) during training to learn unconditional model.
Pros:
- No separate classifier needed
- Works with any condition type (text, image, etc.)
- Better quality than classifier guidance
Cons:
- Requires larger model (learns conditional + unconditional)
- Guidance strength
needs tuning
9.3 Multi-Modal Conditioning
Modern diffusion models support multiple conditions:
| Condition Type | Encoding Method | Integration |
|---|---|---|
| Text | CLIP, T5 transformer | Cross-attention |
| Image | CLIP vision, VAE encoder | Concatenation, attention |
| Depth/Edges | CNN encoder | ControlNet, adapter |
| Pose/Skeleton | Graph neural network | Spatial injection |
| Audio | VGGish, CLAP | Cross-attention |
9.4 Controllability Methods
ControlNet:
- Clone U-Net encoder layers
- Train with zero convolution initialization
- Lock original model, train control branches
IP-Adapter:
- Add image encoder parallel to text encoder
- Use decoupled cross-attention
- Enables image prompt guidance
10. Theoretical Analysis
10.1 Connection to Variational Inference
Diffusion models optimize the variational lower bound (ELBO):
Interpretation:
- Term 1: Reconstruction loss
- Terms 2: Consistency between forward and reverse processes
- Term 3: Prior matching (ensure
is close to Gaussian)
10.2 Connection to Score Matching
Score matching objective:
For diffusion models, this becomes denoising score matching:
10.3 Neural Tangent Kernel (NTK) Analysis
In the infinite-width limit, diffusion model training can be analyzed via NTK:
- Training dynamics: Governed by kernel regression
- Generalization: Related to kernel eigenvalues
- Mode coverage: Depends on data spectrum
10.4 Information Bottleneck Perspective
Forward diffusion as information bottleneck:
- Early timesteps: High mutual information (preserve details)
- Late timesteps: Low mutual information (only semantic info)
- Optimal schedule balances compression and preservation
11. Core Formula Cards
[!QUOTE] Reparameterization Noising
[!QUOTE] DDPM Loss
[!QUOTE] DDIM Sampling
[!QUOTE] Classifier-Free Guidance
12. Evaluation Metrics
12.1 Sample Quality
| Metric | Description | Range |
|---|---|---|
| FID | Fréchet Inception Distance | Lower is better (0 is perfect) |
| IS | Inception Score | Higher is better |
| Precision/Recall | Quality vs. diversity trade-off | [0, 1] |
| KID | Kernel Inception Distance | Lower is better |
FID Formula:
where
12.2 Likelihood Evaluation
Bits per dimension (bpd):
Lower bpd indicates better likelihood.
12.3 Diversity Metrics
- Mode coverage: Percentage of data modes captured
- LPIPS: Learned Perceptual Image Patch Similarity (diversity)
- Unique samples: Ratio of unique generated samples
12.4 Human Evaluation
- User studies: Preference ratings
- Text-image alignment: CLIP score for text-to-image
- Aesthetic quality: Aesthetic score predictors
13. Practical Implementation Tips
13.1 Network Architecture
U-Net Design:
1 | Input |
Key Components:
- ResNet blocks: Groups of 2-3 conv layers with skip connections
- Attention: Self-attention at lowest resolution (e.g., 32x32)
- Time embedding: Sinusoidal position encoding → MLP
- Conditioning: Cross-attention for text, AdaGN for class labels
13.2 Training Best Practices
Hyperparameters:
| Parameter | Recommended Value | Notes |
|---|---|---|
| Timesteps | 1000 | Standard, can use fewer for fast sampling |
| Batch size | 256-512 | Larger is better if memory allows |
| Learning rate | 1e-4 | Use cosine decay schedule |
| Optimizer | AdamW | β₁=0.9, β₂=0.999 |
| EMA rate | 0.9999 | Exponential moving average |
| Gradient clipping | 1.0 | Prevents explosion |
Data Augmentation:
- Random horizontal flip
- Random crop and resize
- No color jitter (changes data distribution)
13.3 Debugging Checklist
✓ Check noise schedule: Plot
✓ Monitor loss curves: Should decrease smoothly, no spikes
✓ Validate reparameterization:
✓ Test sampling: Start with small model, verify basic functionality
✓ Check gradients: Norm should be reasonable (< 10)
✓ Visualize intermediates: Sample at different timesteps during training
13.4 Common Issues and Solutions
| Problem | Cause | Solution |
|---|---|---|
| Blurry samples | Undertraining, high noise | Train longer, check schedule |
| Mode collapse | Low capacity, overfitting | Increase model size, add dropout |
| Training instability | High learning rate | Reduce LR, add gradient clipping |
| Slow sampling | Too many timesteps | Use DDIM, [[DPM-Solver]] |
| Poor conditioning | Weak guidance | Increase guidance strength
|
14. Recent Advances (2023-2024)
14.1 Consistency Models
Key Idea: Learn direct mapping from noise to data in one step.
Benefits:
- 1-step generation (vs. 1000 steps)
- Distillation from pre-trained diffusion models
- Competitive quality with fewer steps
14.2 Rectified Flows
Concept: Learn straight trajectories between noise and data.
where
Advantage: Fewer integration steps needed.
14.3 Diffusion Transformers (DiT)
- Replace U-Net with Transformer
- Scale to billions of parameters
- Better performance with larger models
- Used in SORA, Stable Diffusion 3
14.4 [[Flow Matching]]
General framework encompassing diffusion models:
where
Unifies:
- Diffusion models
- Normalizing flows
- Continuous normalizing flows
14.5 Video Diffusion Models
Challenges:
- Temporal consistency
- Computational cost (3D + time)
- Long sequence generation
Solutions:
- Spatiotemporal attention
- Cascaded generation
- Latent video diffusion
15. Comparison with Other Generative Models
| Model | Quality | Diversity | Training Stability | Sampling Speed | Likelihood |
|---|---|---|---|---|---|
| GAN | ★★★★☆ | ★★☆☆☆ | ★☆☆☆☆ | ★★★★★ | ✗ |
| VAE | ★★☆☆☆ | ★★★★☆ | ★★★★★ | ★★★★★ | ✓ |
| Diffusion | ★★★★★ | ★★★★★ | ★★★★★ | ★★☆☆☆ | ✓ |
| Flow | ★★★☆☆ | ★★★★☆ | ★★★★☆ | ★★★☆☆ | ✓ |
| EBM | ★★★★☆ | ★★★☆☆ | ★★☆☆☆ | ★☆☆☆☆ | ✓ |
When to use Diffusion Models:
- ✓ Need high-quality samples
- ✓ Mode coverage is important
- ✓ Training stability is critical
- ✗ Real-time generation required
- ✗ Limited computational resources
Related Concepts
- [[Wiener Process|Wiener Process]]
- [[Stochastic Differential Equation (SDE)|SDE]]
- [[Score Function]]
- [[Probability Flow ODE]]
- [[Fokker-Planck Equation]]
- [[Kolmogorov Equations]]
- [[DDIM]]
- [[DPM-Solver]]
- [[Flow Matching]]
- [[Markov Process]]
- [[Neural ODE]]
- [[ResNet]]
- [[U-Net]]
- [[DiT]]
- [[Vision Transformer (ViT)]]
- [[Variational Autoencoder (VAE)]]
- [[Generative Adversarial Network (GAN)]]
- [[Langevin Dynamics]]
- [[Denoising Score Matching]]
- [[Itô Integral]]
- [[Martingale]]
Dataview Query
1 | LIST |
References
- Paper:Denoising Diffusion Probabilistic Models (Ho et al., 2020)
- Paper:Score-Based Generative Modeling through SDEs (Song et al., 2021)
- Paper:High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022)
- Blog:What are Diffusion Models? - Lilian Weng
- Course:CS236 Deep Generative Models (Stanford)